Incorporating Hyperlink Analysis in Web Page Clustering

نویسندگان

  • Michael Chau
  • Patrick Y. K. Chau
  • Paul J. Hu
چکیده

The size of the World Wide Web is growing rapidly and it has become a very important source of information that can be useful to various academic and commercial applications. However, because of the large number of documents online, it is becoming increasingly difficult to search for useful information on the Web. General-purpose Web search engines, such as Google and AltaVista, present search results as ranked lists. Such ranked lists can only show users the first few documents of the search results and fail to give them a quick overview of retrieved document set. To address this problem, clustering techniques are often used to group documents into different topics. While traditional clustering algorithms have been applied to Web page clustering, such clustering techniques do not make use of the unique characteristics of the Web, such as its hyperlink structures. In this study, we propose to incorporate hyperlink analysis into the traditional vector space model used in document clustering. Specifically, we will introduce a new metric HFIDF based on link analysis to be used with the traditional TFIDF (term frequency multiplied by inverse document frequency) in similarity measure in clustering algorithms. The proposed study will investigate whether the use of Web structure analysis techniques improve the performance of document clustering in presenting Web search results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Intra-page and Inter-page Semantic Analysis of Web Pages

To make real Web information more machine processable, this paper presents a new approach to intra-page and inter-page semantic analysis of Web pages. Our approach consists of Web pages structure analysis and semantic clustering for intra-page semantic analysis, and machine learning based link semantic analysis for inter-page analysis. Based on the automatic repetitive patterns discovery in str...

متن کامل

Utilizing Hyperlink Transitivity to Improve Web Page Clustering

The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to implement efficient web searched information management and retrieval is becoming a challenge problem. In this paper, we propose an approach to mea...

متن کامل

A Survey Paper of Structure Mining Technique using Clustering and Ranking Algorithm

A survey of various link analysis and clustering algorithms such as Page Rank, Hyperlink-Induced Topic Search, Weighted Page Rank based on Visit of Links K-Means, Fuzzy K-Means. Ranking algorithms illustrated, Weighted Page Rank is more efficient than Hyperlink-induced Topic Search Whereas clustering algorithms has described Fuzzy Soft, Rough K-Means is a mixture of Rough K-Means and fuzzy soft...

متن کامل

Vision-Based Deep Web Data Extraction for Web Document Clustering

The design of web information extraction systems becomes more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. In this paper, an approach to vision-based deep web data extraction is proposed for web document clustering. The proposed approach comprises of two phases: 1) Vision-based web data extraction, and 2) web documen...

متن کامل

A Survey on Clustering Algorithms for web Applications

Web page clustering techniques categorize & organize search results into semantically meaningful clusters that assist users to search relevant information quickly. In general, it provides a solution for data management, information locating & interpretation of web data. Also facilitate users for discrimination, navigation & organization of web pages. Finding information on the World Wide Web is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007